This is the best known dataset for doing classification. It contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.

Predicted attributes: class of iris plant.

Attribute information:

  • sepal length in cm
  • sepal width in cm
  • petal length in cm
  • petal width in cm
  • class: Iris Sentosa, Iris Versicolour, Iris Virginica.

IRIS dataset in scikit-learn

scikit-learn includes this dataset, we can load it directly:


In [9]:
from sklearn import datasets 
iris = datasets.load_iris()
X = iris.data
y = iris.target
X.shape


Out[9]:
(150, 4)

In [10]:
y.shape


Out[10]:
(150,)

The dataset is suitable for classification task. If using supervised learning, some algorithms that can be applied are:

  • Random Forest
  • Logistic Regression
  • Naive Bayes

In [11]:
# import pandas, a data processing and CSV file I/O library
import pandas as pd 

# import seaborn, a Python graphing library
import warnings # current version of seaborn generates a bunch of warnings that we'll ignore
warnings.filterwarnings("ignore")
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="white", color_codes = True)

%matplotlib inline

# Load the Iris flower dataset in CSV format
iris2 = pd.read_csv("dataset/iris.csv")

# Quick check
iris2.head()


Out[11]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

In [12]:
# How many samples we have of each species 
iris2["species"].value_counts()


Out[12]:
setosa        50
versicolor    50
virginica     50
Name: species, dtype: int64

In [13]:
# The first way to plot things is using the .plot extension from Pandas dataframes
iris2.plot(kind="scatter", x =  "sepal_length", y = "sepal_width")


Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0xa5fb608c>

In [21]:
# use seaborn to plot a similar one 
# A seaborn jointplot shows bivariate scatterplots and univariate histogram in the same figure
sns.jointplot (x = "sepal_length", y = "sepal_width", data=iris2, size =5)


Out[21]:
<seaborn.axisgrid.JointGrid at 0xa5a8bf8c>

In [23]:
# Now use seaborn's FacetGrid to color the scatterplot by species 
sns.FacetGrid(iris2, hue= "species", size = 5).map(plt.scatter, "sepal_length", "sepal_width").add_legend()


Out[23]:
<seaborn.axisgrid.FacetGrid at 0xa5727dac>

In [24]:
# Use boxplot to visualize individual feature
sns.boxplot(x = "species", y = "petal_length", data=iris2)


Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0xa572f16c>

In [26]:
# We can extend this plot by adding a layer of individual points on top of it through Seaborn's triplot
# Use jitter=True so that all the points don't fall in single vertical lines above the species 
# saving the resulting axes as ax each time causes the resulting plots to be shown on top of the previous axes
ax = sns.boxplot(x="species", y = "petal_length", data = iris2)
ax = sns.stripplot(x="species", y="petal_length", data = iris2, jitter = True, edgecolor="gray")



In [27]:
# A violin plot combines the benefits of the previous two plots and simplifies them.
# Denser region of the data are fatter, and sparser thinner in a violin plot
sns.violinplot(x="species", y = "petal_length", data = iris2, size =5)


Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0xa57f132c>

In [28]:
# Another useful seaborn plot for looking at univariate relations is the kdeplot, 
# which creates and visualizes a kernel density estimate of underlying features
sns.FacetGrid(iris2, hue="species", size=5)\
    .map(sns.kdeplot, "petal_length")\
    .add_legend()


Out[28]:
<seaborn.axisgrid.FacetGrid at 0xa583102c>

In [30]:
# Another useful seaborn plot is the pairplot, which shows the bivariate relation between each pair of features
# From the pairplot, we'll see that the Iris-setosa is separated from the other two accross all feature combinations
sns.pairplot(iris2, hue="species", size =3)


Out[30]:
<seaborn.axisgrid.PairGrid at 0xa5538bec>

In [31]:
# The diagonal elements in a pairplot shows the histogram by default,
# but we can update these elements to show other things, such as kde
sns.pairplot(iris2, hue = "species", size = 3, diag_kind="kde")


Out[31]:
<seaborn.axisgrid.PairGrid at 0xa4bcfacc>

In [14]:
import numpy as np
from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target, test_size=0.2, random_state=0)

Apply Random Forests


In [15]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators = 10)
clf.fit(X_train, y_train)


Out[15]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [16]:
clf.predict(X_test)


Out[16]:
array([2, 1, 0, 2, 0, 2, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 2, 1, 0,
       0, 2, 0, 0, 1, 1, 0])

In [17]:
y_test


Out[17]:
array([2, 1, 0, 2, 0, 2, 0, 1, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1, 0, 0, 2, 1, 0,
       0, 2, 0, 0, 1, 1, 0])

In [18]:
## Cross validation score
clf.score(X_test, y_test)


Out[18]:
0.96666666666666667

Using Cross Validation

To do

  • Apply some unsupervised learning algorithms on this dataset.

In [ ]: